Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the location and filename of schema.pbtxt to .merlin/schema.json #249

Merged
merged 3 commits into from
Mar 17, 2023

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Mar 16, 2023

Fixes NVIDIA-Merlin/Merlin#845.

This PR moves Merlin-specific metadata files, such as schema, to a sub-directory .merlin in an effort to standardize the location and file format of Merlin schema. The format of the schema file is also changed to JSON.

Models Tests (CPU) in the GitHub Actions CI is failing, but this is expected as Models hosts the dataset tests. PR for addressing this: NVIDIA-Merlin/models#1021

@edknv edknv self-assigned this Mar 16, 2023
@edknv edknv added the chore Maintenance for the repository label Mar 16, 2023
@edknv edknv added this to the Merlin 23.03 milestone Mar 16, 2023
@edknv edknv marked this pull request as ready for review March 16, 2023 18:50
assert os.path.join(path, "_metadata") in data_file_list
metadata_file_list = glob.glob(os.path.join(path, ".merlin", "*"))
assert os.path.join(path, ".merlin", "_file_list.txt") in metadata_file_list
assert os.path.join(path, ".merlin", "_metadata.json") in metadata_file_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjzamora This PR looks good to me, but I wanted to run these files by you before we move them. Can these move to the .merlin directory without breaking things that depend on them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlhigley Thanks for catching this. I think these are metadata files for hugeCTR, and I was maybe thinking hugeCTR is also Merlin so... After double-checking some of hugeCTR code, I think moving those files might break hugeCTR. I reverted those changes in a4c7cef to be safe so that these are still in the same directory as the parquet files, and only schema.json goes into .merlin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that in the longer term they probably should get consolidated into the Merlin metadata directory, but this does seem safer for now 🤝

@karlhigley karlhigley requested a review from rjzamora March 16, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change chore Maintenance for the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize the location and file format of schemas across the Merlin ecosystem
3 participants